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Abstract — We introduce a relative variant of information 
loss to characterize the behavior of deterministic input-output 
systems. We show that the relative loss is closely related to 
Renyi's information dimension. We provide an upper bound for 
continuous input random variables and an exact result for a 
class of functions (comprising quantizers) with infinite absolute 
information loss. A connection between relative information loss 
and reconstruction error is investigated. 

I. Introduction 

System theory provides a vast literature of mathematical 
descriptions of deterministic input-output systems. The gain 
of a linear system at a specific input frequency is specified by 
its transfer function, and for the distortion introduced by non- 
linear components certain single-letter measures (e.g., signal- 
to-distortion ratio) have been defined. These and the measures 
introduced for the design of systems (e.g., the mean-squared 
error) give ample choice to the engineer to characterize a 
system at hand. However, most of the available descriptions are 
energy-centered or consider second-order statistics only. A big 
exception are descriptions of chaotic, autonomous dynamical 
systems H). 

Recently, however, we observe a trend to employ 
information-theoretic descriptions and cost functions, espe- 
cially in machine learning and nonlinear adaptive systems [2|. 
We believe that system theory would also benefit from single- 
letter information-theoretic characterizations of deterministic 
input-output systems, and thus have introduced information 
loss as a possible candidate in [3 1. In this work we complement 
the notion of absolute information loss with its relative version, 
in order to provide a meaningful measure in cases where the 
absolute information loss is infinite. 

Relative information loss for static functions, or fractional 
information loss, has already been introduced by Watanabe [4| 
in the context of stationary stochastic processes on finite 
alphabets. It is also worth mentioning that a rather similar 
quantity has been used in (5), denoted as information gain 

ratio: 

I(C;A) 



H(A) 



(1) 



There, A is an attribute with a finite set of values, C is a class 
variable, and the value of A for which this measure achieves 
its maximum is assumed to be the most appropriate root of 
a decision tree used for classification. In this work we will 
consider the quantity 



I(C;A) _ H(A\C) 



and extend its definition to a larger class of random variables. 
The paper is organized as follows: We define relative infor- 
mation loss in Section |II]and analyze its elementary properties 
in Section [HI] Section [IV] is devoted to a class of deterministic 
systems for which the absolute loss was shown to be infinite. 
We present a bound for the probability of a reconstruction error 
in Section IVl and conclude with a few examples in Section IvTl 

II, A Definition of Relative Information Loss 

We start with recalling the definition given in [3|, where 
the absolute information loss induced by transforming an 
A-dimensional random variable (RV) X to another N- 
dimensional RV Y by a static function g: X — > y, X, y C 



piV 



was given as 



L(X -+ Y) 



sup ( /(X; X) 
v 



/(X;Y) = P(X|Y) (3) 



where the supremum is over all partitions V of X, and where 
X is obtained by quantizing X according to the partition V 
(see Fig. [TJ. 

It was shown in [3], that there exist functions which 
loose an infinite amount of information; in particular, if the 
probability measure Px is absolutely continuous w.r.t. the 
A-dimensional Lebesgue measure (Px <C H N ), quantizers, 
limiters, and mappings to subspaces of lower dimensionality 
suffer from infinite information loss. Since some of these 
functions also transfer an infinite amount of information (i.e., 
I(X; Y) = oo), information loss alone obviously does not 
suffice to fully characterize the function g in information- 
theoretic terms. 

Thus, we complement this absolute quantity of information 
loss by a relative one, indicating the percentage of information 
lost in the function: 

Definition 1. Let X be an A-dimensional RV on the sample 
space X, and let Y be obtained by transforming X with a static 
function g. We define the relative information loss induced by 
this transform as 



l(X 



lim 



g(X n |Y) 

ff(*n) 



(4) 



where X n = 



_ L«xj 



(elementwise). The quantity on the left is 



(2) 



defined if the limit on the right-hand side exists. 

One can consider X„ as being obtained by a vector quan- 
tization of X with quantization bins equal to A-dimensional 
hypercubes of side length — (i.e., using a uniform partition 
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Fig. 1. Model for computing the information loss of a memoryless input- 
output system g. Q is a quantizer with partition V. 



Vn). Note that the partition T^*^ 1 is a refinement of "P 2 fc 
(7V+i < V 2k ). 

Remark: First of all, the limit of a sequence of increasingly 
fine quantizations now takes the place of the supremum in J3}. 
(In DefinitionQ] the supremum would lead to ^(X — > Y) = 1.) 
Alternatively, as it was shown in [3 1, the limit of this sequence 
can also be used in the Definition of absolute information loss, 
i.e., 

L(X -)• Y) = lim ff(X n |Y) = #(X|Y). (5) 

Again, this only holds if the limit exists. 

III. Elementary Properties of Relative 
Information Loss 

We will now highlight the basic properties of relative 
information loss: First of all, l(X — > Y) G [0, 1], which 
is due to the non-negativity of entropy and the fact that 
conditioning reduces entropy. It is interesting to note, however, 
that l(X — > Y) = does not imply that the function g is 
information lossless, i.e., that L(X — > Y) = 0. While this 
holds for discrete RVs X (with finite entropy H(X)), for RVs 
with a continuous component this only means that the absolute 
information loss is finite. Conversely, l(X — > Y) = 1 does not 
imply that the information transfer I(X; Y) is zero. Again, 
while this holds for discrete RVs, for RVs with a continuous 
component l(X — > Y) = 1 implies a finite information 
transfer. However, we can state the following 

Proposition 1. Let X be such that H(X) = oo and let 
l(X -> Y) > 0. Then, L(X ->- Y) = #(X|Y) = oo. 

Proof: We prove the proposition by contradiction. To this 
end, assume that H(X\Y) = L < oo. Then, 

inr __ .. g(X»|Y) . t H(±n\Y) ,,, 

l(X -*■ Y) = lim -^^r— 1 - = lim inf „ w ,~, s (6) 



< lim inf 



lim inf 



H(X n ) 
H(X\Y) 



H(X n ) 



H(X n ) 
L 



= 



(7) 
(8) 



H(X n ) 

where the inequality is due to data processing. The last 
equality follows from the fact that at least a subsequence of 
H(X,i) converges to H(X) (cf. 0, Q). ■ 

Another interesting property of the sequence 



g(X n |Y) 
H(X n ) 



(9) 



is that, while it might be converging (as we will show in 
some practically relevant cases below), it is neither generally 
increasing or decreasing. Consider, for example, a function 
g which is bijective if restricted to elements of the partition 
{Xj}, but non-injective on its domain (cf. Q). Thus, for 
a partition V, l0 -< {Xj}, and an input probability measure 
^x <^ H N the sequence in (|9} is decreasing for all further 
refinements. Conversely, let g be a vector quantizer with 
partition {Xj} and let V na = {Xj}. Here, while H(X no \Y) = 
the sequence in (O increases for all further refinements 
(cf. Section lYEAJi. 

Definition Q] has an interesting relationship to the e-entropy 
proposed by Kolmogorov in Q, ®, but an even more 
tight connection can be made to the information dimension 
proposed by Renyi in [9|. From there, we restate 

Lemma 1 (Asymptotic behavior of H(X n )). Let X be an RV 

with existing information dimension d(X) and let H(Xi) < 
oo. Then, for n — > oo the entropy of the RV X n quantized as 
in Definition \J\ behaves as 



H(X n ) = d{X) logn + h + o(l) 



(10) 



where h is the d(X) -dimensional entropy of X (provided it 
exists). 

Proof: See |9| (cf. also Q, (8)). ■ 

For an absolutely continuous RV X we obtain from this 
Lemma the following 

Corollary 1 (Theorems 1 & 4 in [9|). Let X be an RV with 
-Px ^C fJ. N and H(Xi) < oo. Then, for n — > oo the entropy 
behaves as 

H{X n ) = Nlogn + h(X) + o{l) (11) 

where h(-) is the differential entropy ofX (provided it exists). 

In other words, as a first approximation, the entropy of a 
continuous RV depends on the dimension of its probability 
measure, and only as a second approximation on the shape of 
its density. Note that the second and the third term in Lemma [TJ 
can be neglected for large n. 

Using these results we now maintain 

Theorem 1. Let X be an RV with positive information 
dimension. Then, if d(X|Y = y) exists for all y €E y, the 
relative information loss equals 



l(X-> Y) 



E Y {rf(X|Y=y)} 
d(X) 



(12) 



where Ey {■} denotes the expectation w.r.t. Y. 



Proof: For the proof we use the definition of information 
dimension given in |9l . 



d(X) = lim 



ff(X„) 
logn 



(13) 



where by assumption the limit exists. We obtain 
E Y {d(X|Y = y)} 



l(X -► Y) 



d(X) 



L lim,. 



H(X„|Y= y ; 
log n 



diV(y) 



lim n 

(a) linin-s-oo Jy 



g(X„) 
logn 

r g(X„|Y=y) 
Jy logn 



(14) 
(15) 



diV(y) 



lim r , 



H(X„ 



(6) is _ / y ff(X n |Y = y)rfiV(y) 



lim 

n— )-oo 



H(X n ) 



lim 



g(X n |Y) 
#(X n ) 



where in (a) we used Lebesgue's dominated convergence 
theorem (e.g., [ 10]) and where (b) results from the fact that, 
by assumption, the limits in the numerator and denominator 
exist and are finite. ■ 

This tight connection between relative information loss 
and the ratio of information dimensions leads to a series of 
interesting insights, as we will show in this and a companion 
paper ifTTl . In particular, it will prove useful if the probability 
measures are absolutely continuous w.r.t. Lebesgue measure, 
as information and geometric dimension coincide in this case 
(cf. |9|). 

We are now ready to establish an upper bound on the relative 
information loss in the following 

Theorem 2. Let X be an RV with a probability measure 
-Px "C fi and with H(Xx) < oo. Then, if the quantities 
on the right exist, 



(ih 



l(X -* Y) < 1 J2 KX {1) -> Y) < 1 JT I(Jf« -+ Y 

8=1 8=1 

(19) 
where XW and Y^'are the components o/X and Y, respec- 
tively. 

Proof: With Definition Q] and the chain rule of entropy 
we get 



l(X -> Y) = lim 



< lim 

n— J-oc 

(a) 1 



(1) 



T,tiH(Xk v \X^,...,Xr i, ,Y) 



.(i-i) 



H(X n ) 



^N 



(') 



(20) 



d(X] 



E;Llg(^n |Y) 

H(X n ) 

N 

^E Y {d(X«|Y = y)} (21) 



where in (a) we exchanged summation and limit for similar 
reasons as in the proof of Theorem Q] Corollary [TJnow tells us 
that due to the absolute continuity d(X) = N and d(A" w ) = 1 



for all i. We thus obtain 

N 

N 



1 * E Y rfl (,) Y = y } 
(X -► Y) < — V l V ; ,' y -^- 



1 N 



(22) 



which proves the first inequality. The second inequality is 
( 16 ) obtained by bounding iJ(X n l) |Y) < H(X^\Y^) in ©. 

■ 

At this point it is worth noting that throughout Section [III] 

(1 ') no assumptions about a functional dependence between X and 

Y were made. Indeed, all statements made in this Section 

ng) hold equally for stochastic relationships (including stochastic 

independence) between X and Y. 



IV. Relative Information Loss for Functions 
which are Constant 

We now apply the relative information loss of Definition [TJ 
to a class of functions for which we showed in [3| that 
the absolute information loss is infinite. In particular, we are 
talking about functions which are constant on subsets Ai of 
the domain with positive probability measure. 

To this end, let Px ^ H N be concentrated on a compact set 
X C R N . Let further A, C X with P x (^ 4 ) > 0. Without loss 
of generality, we assume that the subsets Ai are disjoint. Now 
take Y = g(X), where g: X — > y, y C R N , is surjective, 
measurable, and constant on Ai, i.e., g(Aj) = y^. As a 
consequence, Py is atomic on {y^} (thus, L(X — > Y) = oo; 
cf. |H] Corollary 2]). With A = [J t Ai we further require that 
g is piecewise bijectivqj on X \ A, from which follows that 
Py is absolutely continuous on y \ {yi}. 

We can now state the following 

Proposition 2. Let X be an RV with probability measure 
Px ^ H N concentrated on a compact set X C K . Let g be 
such that it is constant on sets Ai of positive P^-measure and 
piecewise bijective elsewhere. Then, the relative information 
loss is 

l(X^Y)=P x (A) (23) 

where A = [J t A^ 

Proof: By assumption and Corollary Q] we have d(X) = 
N. Due to the properties of the function g, Py decomposes 
into a component P Y C ^ ^ an d an atomic component P Y 
concentrated on the points y^ = g(Aj). Thus, the preimage 
of points yi with positive /V-measure is the union of the 
set Ai and a countable number of points {Xij}, Since the 
set Ai has positive Px-measure (otherwise P Y (yi) = 0), 
the conditional probability measure Px| Y=yi ^C M • D ue 
to the compactness of the support X the conditional entropy 
H(Xi\Y = yi) < oo, thus the associated information dimen- 
sion exists and equals N. For all other points y e y \ {yi} 

1 see 1 3] for a possible definition 



the preimage is a countable union of points. The associate 
conditional probability measure is O-dimensional. 

We now prove this Proposition with the help of Theorem fl] 

l(X^Y) = ±J d(X\Y = y)dP Y (y) (24) 

d(X|Y = y)dP Y (y) 



N Jy\{ yi } 

+ ^E d ( x l Y = ^) p Y(y,) (25) 



£^V(y,) 



(26) 



Since the preimage of y. b under g consists of a set Aj of 
positive Px-measure and (zero-measure) points, we can write 

l(X ->■ Y) = J2 Prfri) = £ Px ^) = Px ( A ) < 27 > 



where the last equality follows from the fact that Ai are 
disjoint and the additivity of the measure Px- ■ 

The interesting implication of this result is that the shape 
of the PDF on A has no influence on the relative loss, and 
neither has the number of different sets Ai (with different 
output values y$) - yet, all these do have an influence on 
the information transport 7(X;Y). This is in conflict with 
intuition, which suggests that whatever influences information 
transfer should also influence information loss, and, thus, also 
relative information loss. Yet, both the properties in SectionHIIl 
and the fact that H(X n ), as a first approximation, depends 
more on the dimension and the quantization bin size than on 
the shape of the PDF [7] confirm this theoretical result. 

Furthermore, in this particular case it turns out that Y is a 

mixture of a continuous and a discrete RV with information 

dimension 1 - Pjc(A) 0, (12). One is thus led to the 

conjecture that indeed under some circumstances one can show 

that 

d(Y) 



Z(X 



1 



d(X)- 



(28) 



If this really holds and under which conditions it does is 
currently under investigation. 

V. Relative Information Loss and Reconstruction 

Error 

We next want to find connections between the relative 
information loss and the probability of a reconstruction error 
given by 

P e = min Pr(X ± /(Y)) (29) 

where / is a function that tries to estimate or reconstruct 
the original X from its image Y. It is well known that 
Fano's inequality does not hold for countably infinite alphabets 
(e.g. Q~3)). However, we employ Fano's inequality here to 
derive a relationship between relative information loss and the 
probability of a reconstruction error by starting from a finite 
alphabet and then taking the limit. We present 



Theorem 3. Let X be a RV with a probability measure Px *C 
fi N which is concentrated on a compact set X C M. , Let P e 
denote the probability of a reconstruction error. Then, the error 
probability is bounded by the relative information loss from 
below, i.e., 

P e >l{X^Y). (30) 

Proof: For the proof we start with a quantized version 
of the input RV, X n . Since X„ is a discrete RV on a finite 
alphabet X n , we can employ the standard Fano bound fl4*l . 



P(X„|Y)<P 2 (P e 



P e ,„ log card( X n 



where 



Pr(X„^/*(Y)). 



(31) 



(32) 



Since Fano's inequality holds for arbitrary estimators /*, we 
let /* be the composition of f° = argmin/Pr(X ^ /(Y)) 
and the quantizer of Definition Q] P e „ is the probability that 
/°(Y) and X do not lie in the same quantization bin. Since 
the bin volume reduces with n, P e . n increases monotonically 
to P e . We thus obtain with H 2 {p) < 1 for all < p < 1 



P(X„|Y) < 1 + P e logcard(AW). 
We next define the diameter D of X as 



D = sup \\x\ 



X2\ 



(33) 



(34) 



where || • || is the Euclidean distance and where D < oo 
due to the compactness of X. As an immediate consequence, 
A" can be covered by an TV-dimensional hypercube with side 
length D. Quantizing X with a vector quantizer corresponds 
to covering X by hypercubes of side length —. It thus follows 
that 

A M AT 

(35) 



card(i" n ) < (\nD]) N < (nD + l) N 



and finally 

P(X„|Y) < 1 + P e AHog (nP + 1) . (36) 

With Corollary Q] we thus get 



l(X -> Y) = lim 



< lim 



(a) ,. 1 

= lim 



P(X»|Y) 

H(X n ) 
1 + P e N log {nD + 1) 



H(X n ) 



n->oo N log n 
P. 



P e iVlog(nP + l) 
N log n 



(37) 

(38) 

(39) 
(40) 



where in (a) we again used Theorem Q] and the fact that 
rf(X) = N. This completes the proof. ■ 

VI. Examples 

In this Section we will now illustrate the theoretical results 
at the hand of a few examples. 



7(*) 



Fig. 2. Center clipper of Example 2 



A. Quantizers 

The first practical application of our results will be the 
analysis of a quantizer, which is typically used to represent a 
continuous RV by a discrete RV, designed according to some 
optimality criterion (mean-squared reconstruction error, max- 
imum output entropy, etc.). Since the output of the quantizer 
is discrete in amplitude, it is clear that an infinite amount 
of information is lost. In addition to that, since the quantizer 
function is constant almost everywhere it turns out that the 
relative information loss is unity: 



Z(X -+ Y) = 1 



(41) 



In other words, disrespective of the (finite) number of quan- 
tization bins and the design criterion, the quantizer always 
destroys 100% of the available information. This holds equally 
for scalar and vector quantizers. Note, however, that despite 
this fact still a positive amount of information is transferred 
by the quantizer (cf. Section Hill). 

B. Center Clipper 

In signal processing center clippers (see Fig. |2} are used 
for noise suppression or residual echo cancellation [ 15]. We 
let the center clipper be described by the following function: 



9{x) 




(42) 



By Theorem|2]the relative information loss evaluates to l(X — > 
Y) = Px([—c,c\), which reveals that it depends only on the 
clipping parameter c and the probability mass contained in that 
interval. Yet, since center clippers do enhance signal quality 
in many cases, this suggests that probably a different measure 
of information loss could be more appropriate. 

Note further that the center clipper is bijective if it is 
restricted to X \ [—c.c]. Thus, while outside of [— c, c] we 
have a zero probability of a reconstruction error, within the 
center interval the error probability is unity. As a consequence, 
P e = Px{[—c, c]) which makes the bound of Theorem [3] tight. 
If g was not bijective outside of [— c, c], but, e.g., would destroy 
the sign information, then P e > Px{[—c,c\) and Theorem [3] 
still holds. 



VII. Conclusion 

In this work, we introduced the notion of relative informa- 
tion loss, complementing its absolute variant presented by the 
authors in a previous work. We showed that there is a close 
connection between the relative loss and the Renyi information 
dimension of the input and the conditional random variable of 
the input given the output. 

For a continuous-valued input both upper bounds and an 
exact expression for a certain class of systems was presented. 
In particular, it was shown that quantizers loose 100% of 
the available information. We finally analyzed a connection 
between the probability of reconstruction error and relative 
information loss. 
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Appendix 
We now show the following 

Lemma 2. 

lim F(X„|Y) = H(X\Y) (43) 

provided the limit exists. 

Proof: For the proof we note that 

#(X n |Y)=I(X;X n |Y) (44) 

because X„ is a function of X |6| Ch. 3.9]. Further, if £ = 
(£1,62, . . . ) we obtain with Thm. 3.10.1] 



lim /((£i, &,..., £„);r?|e)=J(£;r7|e). 



(45) 



We now identify e = Y and 77 = X. Furthermore, if the 
limit in Lemma [2] exists, all subsequences converge to the 
same limit. In particular, also the subsequence X 2 fc converges 
to the same limit. We now identify this RV with the binary 
expansion of X up to order k; thus, X 2 k = (£1,^2, • ■ • ,6c)- 
Clearly, lim/ c _ i . 00 X 2 fc = X. Comparing this to d45T > completes 
the proof. ■ 
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